This dataset is public available for research. The details are described in [Cortez et al., 2009]. P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at: http://dx.doi.org/10.1016/j.dss.2009.05.016
This report explores a dataset containing quality rates and attributes of about 5,000 white wines.The inputs are physicochemical test results (e.g. pH or citric acid) and the output is a sensory data of the wine quality (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).
The dataset is related to a white variant of the Portuguese “Vinho Verde” wine. For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
The dataset consists of 12 variables about attributes, with almost 4,900 observations.
For more information, read [Cortez et al., 2009].
Input variables (based on physicochemical tests):
1 - fixed acidity (tartaric acid - g / dm^3)
2 - volatile acidity (acetic acid - g / dm^3)
3 - citric acid (g / dm^3)
4 - residual sugar (g / dm^3)
5 - chlorides (sodium chloride - g / dm^3
6 - free sulfur dioxide (mg / dm^3)
7 - total sulfur dioxide (mg / dm^3)
8 - density (g / cm^3)
9 - pH
10 - sulphates (potassium sulphate - g / dm3)
11 - alcohol (% by volume)
Output variable (based on sensory data):
12 - quality (score between 0 and 10)
1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavour to wines
4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
5 - chlorides: the amount of salt in the wine
6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of the wine
7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant
11 - alcohol: the percent alcohol content of the wine
Output variable (based on sensory data): 12 - quality (score between 0 and 10)
## [1] 4898 13
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
Observations from the Summary I am not focusing on all the variables. I will exclude the density, which depends on the combination of alcohol and sugar, the free sulfur dioxide, which is a part of the total sulfur dioxide (total sulfur dioxide(SO2) = free SO2 + bound SO2, reference), and the fixed acidity, which is similar measure of the acidity than pH. According to this site “Fixed acidity is measured as total acidity minus volatile acidity. Generally, pH is a quantitative assessment of fixed acidity.”
In the remaining parameters, we can already observe some interesting facts. Details for each parameter will be given in the single variable study. - The mean residual sugar is 6.391 g/L, but the maximum is 65.8 g/L and this wine seems to be too sweet (over 45g/L) and is an outlier. - The mean level of chlorides is 0.045 with a maximum of 0.345. This maximum point might be an outlier. - The mean total sulfur dioxide is 138 ppm, with more than 75% of the wine over 108ppm (first quartile). Over 50ppm, the sulfur dioxide will have an impact on the taste of the wine. This parameter might have an impact on the quality.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
The quality of most of the wines is 6, on a scale from 3 to 9. The wine quality has a normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
The mean volatile acidity is 0.2782 g/L (median 0.26), with a maximum of 1.1 g/L, which corresponds to the US legal limits for white wine. Most wines have a volatile acidity between 0.15 g/L and 0.35 g/L with a high peak at 0.25 g/L (first graphic) and a normal distribution with a small right-skewed. By reducing the binwidth (second graphic), we observe that there is not a single peak but a high count of wine between 0.24 g/L and 0.28 g/L. We know that too high quantity of volatile acidity is bad for the wine and it will be interesting to study the link of this parameter with the quality.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
The mean level of citric acid is 0.3342 g/L (median 0.32). Some wine has no citric acid, however, it represents only a few with a first quartile at 0.27 g/L. Most of the wines have between 0.2 g/L and 0.55 g/L, with normal distribution and few wines with a concentration higher than 1 g/L. The presence of citric acid is a benefit, bringing ‘freshness’ and flavour to wines and it will be interesting to see the impact on the quality.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
Looking at ‘table’ summary we see that there was one outlier: 65.8, so I limited the data to all wines with residual sugar less or equal to 45 g/L . The distribution is skewed so I used log10 on the x-axis for a second graph.
We can see that the residual sugar concentration is a bimodal distribution, meaning that there are two different groups: dry (not sweet) white wine (1 to 4 g/L) and slightly sweet white wines (4 to 19 g/L).
For this reason, a new variable sugar_category is created with IFELSE function with the limit of 4 g/L (exclude from the dry group) between the dry and slight_sweet .
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.0 0.27 0.36 20.7 0.045
## 2 2 6.3 0.30 0.34 1.6 0.049
## 3 3 8.1 0.28 0.40 6.9 0.050
## 4 4 7.2 0.23 0.32 8.5 0.058
## 5 5 7.2 0.23 0.32 8.5 0.058
## 6 6 8.1 0.28 0.40 6.9 0.050
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 45 170 1.0010 3.00 0.45 8.8
## 2 14 132 0.9940 3.30 0.49 9.5
## 3 30 97 0.9951 3.26 0.44 10.1
## 4 47 186 0.9956 3.19 0.40 9.9
## 5 47 186 0.9956 3.19 0.40 9.9
## 6 30 97 0.9951 3.26 0.44 10.1
## quality sugar_category
## 1 6 slight_sweet
## 2 6 dry
## 3 6 slight_sweet
## 4 6 slight_sweet
## 5 6 slight_sweet
## 6 6 slight_sweet
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
The maximal value (0.346) might not be an outlier, however very few wines have more than 0.10. The next graphic is to focus on the wines with a level lower or equal to 0.10.
The mean level of chloride is 0.046 (median 0.043), with most of the wines having a level between 0.025 and 0.06 with a normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
As said before, the mean total sulfur dioxide is 138 ppm (median: 134 ppm), with most of the wines between 60 and 220 ppm and a normal distribution. Over 50ppm, the sulfur dioxide will have an impact in the taste of the wine. This parameter might have an impact in the quality.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
All wine have a pH range between 2.7 to 3.8, with a mean at 3.2. None of the wine are basic (no pH higher than 7 ).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
The mean level of sulphates is 0.49 g/L (median 0.47) with a normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
Alcohol level distribution is right skewed, with most of the wines at 9.5% and a mean at 10.5%.
There are 4898 wines in the dataset with 12 features (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol).
The main feature in the dataset is the quality. I would like to determine which features are influencing the quality of wine.
After researching information about wine, I think that residual sugar, alcohol, volatile acidity and citric acid contribute most to the quality.
I have observed that the distribution of residual sugar is bimodal, meaning that there are two different groups. I have created a new variable sugar_category, with 2 classes dry (1 to 4 g/L excluded) and slight_sweet ( equal or more than 4 g/L).
The distribution is skewed so I used log10 and the distribution was bimodal.
I subselect the variables decided at the end of the Summary part: exclude the density (depends on the combinaison of alcohol and sugar), the free sulfur dioxide (part of the total sulfur dioxide) and the fixed acidity (similar than pH). We can measure the correlation coefficients to be sure.
##
## Pearson's product-moment correlation
##
## data: wine$density and wine$alcohol
## t = -87.255, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7908646 -0.7689315
## sample estimates:
## cor
## -0.7801376
##
## Pearson's product-moment correlation
##
## data: wine$density and wine$residual.sugar
## t = 107.87, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8304732 0.8470698
## sample estimates:
## cor
## 0.8389665
##
## Pearson's product-moment correlation
##
## data: wine$free.sulfur.dioxide and wine$total.sulfur.dioxide
## t = 54.645, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5977994 0.6326026
## sample estimates:
## cor
## 0.615501
##
## Pearson's product-moment correlation
##
## data: wine$fixed.acidity and wine$pH
## t = -32.934, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4485154 -0.4026542
## sample estimates:
## cor
## -0.4258583
## volatile.acidity citric.acid residual.sugar chlorides
## 1 0.27 0.36 20.7 0.045
## 2 0.30 0.34 1.6 0.049
## 3 0.28 0.40 6.9 0.050
## 4 0.23 0.32 8.5 0.058
## 5 0.23 0.32 8.5 0.058
## 6 0.28 0.40 6.9 0.050
## total.sulfur.dioxide pH sulphates alcohol quality sugar_category
## 1 170 3.00 0.45 8.8 6 slight_sweet
## 2 132 3.30 0.49 9.5 6 dry
## 3 97 3.26 0.44 10.1 6 slight_sweet
## 4 186 3.19 0.40 9.9 6 slight_sweet
## 5 186 3.19 0.40 9.9 6 slight_sweet
## 6 97 3.26 0.44 10.1 6 slight_sweet
## Warning in ggcorr(sample, hjust = 0.75, size = 3, label = TRUE, label_size
## = 3, : data in column(s) 'sugar_category' are not numeric and were ignored
We can see some correlations like: * total sulfur dioxide vs residual sugar (moderate positive correlation) * alcohol vs residual sugar (moderate negative correlation) * alcohol vs chlorides (small negative correlation) * alcohol vs total sulfur dioxide (moderate negative correlation) * quality vs alcohol (moderate positive correlation)
First, I will have a look at the correlations observed in the scatterplot matrice between the objective parameters (exclude quality). Second, I want to look closer at plots involving quality and some other variables like alcohol, volatile acidity, residual sugar and citric acid. Indeed, in the [original paper] (http://dx.doi.org/10.1016/j.dss.2009.05.016) these factors were considered to take part in the model of quality.
Comparing residual sugar vs total sulfur dioxide or alcohol, the first plots suffers from overplotting, not ideal x scale, one outlier. Adding jitter, transparency, changing the x-scale to log10, changing the y limits, and excluding the sugar outlier (with subset) let us see the moderate correlations calculated before. I add linear regression line to best visual it. Moreover, I have created two groups based on the sugar level, with a limit of 4. With the vertical line we can observe that for total sulfur dioxide vs residual sugar, both groups have the same tendency, while for alcohol vs residual sugar, the two groups seem to have different patterns.
Comparing alcohol vs total sulfur dioxide or chlorides, the first plots suffer from overplotting and large spreading of points. Adding jitter, transparency, smaller points, and changing the y limits let us see the moderate correlations calculated before. I add linear regression lines to best visual it. For chlorides, I did not consider in the graphic the top 5% of values and for total sulfur dioxide I did not consider the top 1% of the values.
## subtable$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.55 10.45 10.35 11.00 12.60
## --------------------------------------------------------
## subtable$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.40 10.10 10.15 10.75 13.50
## --------------------------------------------------------
## subtable$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.000 9.200 9.500 9.809 10.300 13.600
## --------------------------------------------------------
## subtable$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 9.60 10.50 10.58 11.40 14.00
## --------------------------------------------------------
## subtable$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.60 10.60 11.40 11.37 12.30 14.20
## --------------------------------------------------------
## subtable$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 11.00 12.00 11.64 12.60 14.00
## --------------------------------------------------------
## subtable$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 12.40 12.50 12.18 12.70 12.90
##
## Pearson's product-moment correlation
##
## data: subtable$alcohol and subtable$quality
## t = 33.858, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4126015 0.4579941
## sample estimates:
## cor
## 0.4355747
There is a correlation between the Alcohol and Quality. It seems like a threshold at an alcohol level of 11 to separate the lower quality wines (3 to 6) and the upper quality wines (7 to 9)
## subtable$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1700 0.2375 0.2600 0.3332 0.4125 0.6400
## --------------------------------------------------------
## subtable$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1100 0.2700 0.3200 0.3812 0.4600 1.1000
## --------------------------------------------------------
## subtable$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.100 0.240 0.280 0.302 0.340 0.905
## --------------------------------------------------------
## subtable$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2000 0.2500 0.2606 0.3000 0.9650
## --------------------------------------------------------
## subtable$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.1900 0.2500 0.2628 0.3200 0.7600
## --------------------------------------------------------
## subtable$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.2000 0.2600 0.2774 0.3300 0.6600
## --------------------------------------------------------
## subtable$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.240 0.260 0.270 0.298 0.360 0.360
The volatile acidity correspond to the amount of acetic acid in wine and at high level it can lead to an unpleasant taste. I was expecting that higher quality wines would have a lower level of volatile acidity. Surprisely, all wines have same range of volatile acidity level. In this dataset the volatile acidity seems to not influence the quality rating.
The outlier of residual sugar is directly excluded from the graphic.
## subtable$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.587 4.600 6.393 10.700 16.200
## --------------------------------------------------------
## subtable$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.300 2.500 4.628 7.100 17.550
## --------------------------------------------------------
## subtable$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.800 7.000 7.335 11.500 23.500
## --------------------------------------------------------
## subtable$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.700 5.300 6.442 9.900 65.800
## --------------------------------------------------------
## subtable$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.700 3.650 5.186 7.325 19.250
## --------------------------------------------------------
## subtable$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.800 2.100 4.300 5.671 8.200 14.800
## --------------------------------------------------------
## subtable$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.60 2.00 2.20 4.12 4.20 10.60
##
## Pearson's product-moment correlation
##
## data: subtable$residual.sugar and subtable$quality
## t = -6.8603, df = 4896, p-value = 7.724e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.12524103 -0.06976101
## sample estimates:
## cor
## -0.09757683
It doesn’t look like the higher quality have a certain level of residual sugar. We can say that the residual sugar level is not a major component influencing the wine quality (see correlation coefficient). However, in the dataset, I split wines in two categories depending of their residual sugar level. We observe that both groups have the distribution of quality (graphic below) and means are really close.
## subtable$sugar_category: dry
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 5.00 6.00 5.95 7.00 9.00
## --------------------------------------------------------
## subtable$sugar_category: slight_sweet
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.824 6.000 9.000
## subtable$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2100 0.2575 0.3450 0.3360 0.3850 0.4700
## --------------------------------------------------------
## subtable$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1900 0.2900 0.3042 0.4000 0.8800
## --------------------------------------------------------
## subtable$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2400 0.3200 0.3377 0.4100 1.0000
## --------------------------------------------------------
## subtable$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.270 0.320 0.338 0.380 1.660
## --------------------------------------------------------
## subtable$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.2800 0.3100 0.3256 0.3600 0.7400
## --------------------------------------------------------
## subtable$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0400 0.2800 0.3200 0.3265 0.3600 0.7400
## --------------------------------------------------------
## subtable$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.290 0.340 0.360 0.386 0.450 0.490
As described before, citric acid can add ‘freshness’ and flavor to wines. I was expecting that higher quality wines have a higher level of citric acid. However, the average of citric acid is quite constant over the quality. We can also observe that among all qualities, the mean is quite constant (around a citric acid level of 0.33) and with the increase of quality, the variance seems to reduce.
I focused on the correlations between quality and * residual sugar: the level of residual sugar is constant among the quality groups. The residual sugar level seems to not be a major component influencing the wine quality. * volatile acidity: there is no variation of volatile acidity between the different quality groups. * citric acid: all the different quality groups have quite the same average citric acid level. With higher quality, the variance is reduced. * alcohol: There is a correlation between quality and alcohol. The higher the level of alcohol is, the higher the quality is.
There is a correlation between: - the alcohol and the total sulfur dioxide - the alcohol and chloride - the alcohol and residual sugar Hower we observed two groups in the correlation alcohol vs residual sugar . These two groups also correspond to the two groups observed in the bimodal distribution of residual sugar, with a limit of 4 g/L.
The strongest relationship observed was between alcohol and quality.
In the previous part, we observed that alcohol vs residual sugar seems to have 2 behaviour with the limit of 4 g/L of residual sugar. In the first part I have created the variable sugar_category, separating the wines into two categories. The following graphic explores if there is a different behaviour depending on the sugar_category.
## # A tibble: 2 x 3
## sugar_category alcohol.median alcohol.mean
## <chr> <dbl> <dbl>
## 1 dry 11.0 11.0
## 2 slight_sweet 9.80 10.2
The ‘dry’ wines have an increase of alcohol with the increase of residual sugar, while the ‘slight sweet’ sugar have an opposite correlation. In fact it means that by increasing the level of sugar, it increases the level of alcohol until a “breaking point” of 4 g/L. After this limit, the sugar acts as an inhibitor of the alcohol. The idea is thus to analyse the variables in the rest of the study by separating these two groups. We observe that the dry wines contain in average more alcohol than the slight sweet wine.
## # A tibble: 2 x 3
## sugar_category chloride.median chlorides.mean
## <chr> <dbl> <dbl>
## 1 dry 0.0400 0.0441
## 2 slight_sweet 0.0450 0.0470
Pearson correlation coefficient for each sugar category
## subtable$sugar_category: dry
## [1] -0.3921068
## --------------------------------------------------------
## subtable$sugar_category: slight_sweet
## [1] -0.3367922
Both sugar categories have a negative correlation between alcohol and chlorides. Both categories have the same average chloride level. But the Pearson correlation coefficient is higher in the dry wines than the sweet wine. This means that the alcohol level of the dry wines might be more influenced by the chlorides than the sweet wines.
## # A tibble: 2 x 3
## sugar_category total.sulfur.median total.sulfur.mean
## <chr> <dbl> <dbl>
## 1 dry 117 120
## 2 slight_sweet 151 152
Pearson correlation coefficient for each sugar category
## subtable$sugar_category: dry
## [1] -0.2257001
## --------------------------------------------------------
## subtable$sugar_category: slight_sweet
## [1] -0.4608308
The sweet wines have a higher level of total sulfur dioxide than dry wines. By separating the sugar categories, we can say that for the dry wines there is no correlation between alcohol and total sulfur dioxide (|coeffiencient| < 0.3), while for the sweet wines there is a negative correlation between alcohol and the total sulfur dioxide level.
Pearson correlation coefficient for each sugar category
## subtable$sugar_category: dry
## [1] 0.453321
## --------------------------------------------------------
## subtable$sugar_category: slight_sweet
## [1] 0.4292139
Both sugar categories have a positive correlation between alcohol and quality. But the Pearson correlation coefficient is higher in the dry wines than the sweet wine. This means that the quality of the dry wines might be more influenced by the alcohol than the sweet wine quality. The presence of sugar might “hide” the difference of quality between wines in the sweet category. Or the decrease of alcohol observed in the sweet wines makes these wines of a lower quality.
Previously, we saw that the influence of the chlorides on the alcohol in the dry wines is more pronounced than in the sweet wines. In this graphic, we see that there is more a separation of the chloride levels in dry wines than in sweet wines. Because alcohol and chlorides are in a negative correlation, we see that for the same quality in dry wines, the wines with higher chlorides have a lower alcohol level.
Previously, we saw that the influence of the total sulfur dioxide on the alcohol in the sweet wines is more pronounced than in the dry wines. In this graphic, we see that there is more a separation of the total sulfur dioxide levels in the sweet wines than in the dry wines. Because alcohol and total sulfur dioxide are in a negative correlation, we see that for the same quality in sweet wines, the wines with higher total sulfur dioxide have a lower alcohol level.
We observe that in this dataset there is two type of behaviour depending on the sugar level. As previously observed, the alcohol level is correlated with the quality, which is the strongest correlation found. The correlation is more important in dry wines. Dry wines have more alcohol and in this category, the chloride level has an important negative influence on the alcohol. Sweet wines have less alcohol and in this category, the total sulfur dioxide level has an important negative influence on the alcohol.
The higher correlation value is 0.45 with alcohol vs quality in dry wines. This is not a high correlation level so we cannot use alcohol as a parameter for quality prediction.
Plus plotting the distribution of residual sugar, we have observed that the wines can be split into two groups: “dry” and “slight sweet” wines, with a limit of 4 g/L. We have found that aech group have a different behaviour.
Alcohol level and Quality have correlation in the dry wine group and slight sweet group, with respectively a coefficient of 0.45 and 0.43. It means that, the more alcoholic the wine is, the better the rater will find it. The influence of alcohol is more pronounced in the dry wines. However, 0.4 is not a high correlation level so we cannot use alcohol as a parameter for quality prediction.
For each type of sugar category, we have observed other variables that might influence the level of alcohol for a given quality. For the dry wines, high level of chlorides seems to reduce the level of alcohol. For the slight sweet wines, high level of total sulfur dioxide seems to reduce the level of alcohol.
The analysis of this dataset of white wines lead us to this conclusions: * There are 2 groups of wines based on their residual sugar. * By increasing the amount of residual sugar, it increases the level of alcohol until a breaking point. After this breaking point, the addition of sugar is asking as an inhibitor of the alcohol. * The dry wine whites contain more alcohol than the slightly sweet white wines. * The chlorides decrease the alcohol level, with a more pronounced effect in the dry wines. * The total sulfur dioxide decrease the alcohol level, with a more pronounced effect in the slightly sweet white. * Alcohol level and quality are positively correlated, with a stronger effect in the dry wines. * Surprisingly the volatile acidity level, the residual sugar level and the citric acid level do not have an influence on the quality.
The level of alcohol and residual sugar can be controlled during the production process and the sulfur dioxide is added. However the chloride concentration in the wine is influenced by terroir [ref]. The idea is to add step by step adding sugar to be before the breaking point and producing during these steps the higher level of alcohol. At the same time, reducing the amount of sulfur dioxide could improve the quality of the wine. However, we can conclude that the experts’ quality rating is mostly based on their personal taste or could depend on other variables like the year of production, the grate types or the terroir.
For further exploration, the same analysis could be done on red wines and compare the results with this white wine dataset.